```{r setup}
knitr::opts_chunk$set(echo = TRUE)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(tidytext)
library(textstem)
library(clinspacy)
library(topicmodels)
```

This practical is based on exploratory data analysis, named entity recognition, and topic modelling of unstructured medical note free-text data derived from electronic medical records (EMR).  Real EMR data is very difficult to access without a specific need/request so this data set is derived from [medical transcription](https://mtsamples.com/) data instead.  I'll also caveat that the options of natural language processing (NLP) in R are far inferior to those available in Python. 

First, install the packages in the setup block (`install.packages(c("readr", "dplyr", "tidyr", "ggplot2", "tidtext", "textstem", "clinspacy", "topicmodels"))`).

Note: To try and make it clearer which library certain functions are coming from clearer, I'll try to do explicit imports throughout this notebook.

## Data Parsing

After that we can grab the dataset directly from the `clinspacy` library.

```{r}
raw.data <- clinspacy::dataset_mtsamples()
dplyr::glimpse(raw.data)
```

There is no explanation or data dictionary with this dataset, which is a surprisingly common and frustrating turn of events!  

**1** Using the output of dplyr's `glimpse` command (or rstudio's data viewer by clicking on `raw.data` in the Environment pane) provide a description of what you think each in this dataset contains.


Let's see how many different medical specialties are featured in these notes: 
```{r}

raw.data %>% dplyr::select(medical_specialty) %>% dplyr::n_distinct()
```

So, how many transcripts are there from each specialty:


```{r}
ggplot2::ggplot(raw.data, ggplot2::aes(y=medical_specialty)) + ggplot2::geom_bar()
```

Let's make our life easier and filter down to 3 specialties: a diagonstic/lab, a medical, and a surgical specialty

```{r} 
analysis.data <- raw.data %>% dplyr::filter(medical_specialty %in% c("Neurology", "Radiology", "Neurosurgery")) 
```

## Text Processing

Let's now apply our standard pre-processing to the transcripts from these specialties.  
We are going to use the `tidytext` package to tokenise the transcript free-text.  
By default this tokenises to words but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regular expression.

```{r}
tokenized.data <- analysis.data %>% tidytext::unnest_tokens(word, transcription, to_lower=TRUE)
```

How many unique tokens are there in the transcripts from each specialty:

```{r}
tokenized.data %>% dplyr::group_by(medical_specialty) %>% dplyr::distinct(word) %>% dplyr::summarise(n=dplyr::n())
```

However, there are a lot of extremely common words e.g., "the", "of", "to", and so forth.  
These are known as stop words and we can remove them relative easily using a list from  `tidytext::stop_words` and `dplyr::anti_join()`

**2** How many stop words are there in `tidytext::stop_words`?

```{r}
no.stop.tokenized.data <- tokenized.data %>% dplyr::anti_join(tidytext::stop_words) 
```

**3** How many unique words are there in each category without stop words and numbers?

Sometimes we are interested in tokenising/segmenting things other than words like whole sentences or paragraphs.  
**4** How many unique sentences are there in each category? Hint: use `?tidytext::unnest_tokens` to see the documentation for this function.

Now that we've tokenized to words and removed stop words, we can find the most commonly word used within each category:

```{r}
no.stop.tokenized.data %>%
  dplyr::group_by(medical_specialty) %>%
  dplyr::count(word, sort = TRUE) %>%
  dplyr::top_n(5)
```

We should lemmatize the tokenized words to prevent over counting of similar words before further analyses.  
Annoyingly, `tidytext` doesn't have a built-in lemmatizer.

**5** Do you think a general purpose lemmatizer will work well for medical data? Why not?

Unfortunately, a specialised lemmatizer like in `clinspacy` is going to be very painful to install so we will just use a simple lemmatizer for now:

```{r}
lemmatized.data <- no.stop.tokenized.data %>% dplyr::mutate(lemma=textstem::lemmatize_words(word))
```

We can now calculate the frequency of lemmas within each specialty and note.
```{r}
lemma.freq <- lemmatized.data %>% 
  dplyr::count(medical_specialty, lemma) %>%
  dplyr::group_by(medical_specialty) %>% 
  dplyr::mutate(proportion = n / sum(n)) %>%
  tidyr::pivot_wider(names_from = medical_specialty, values_from = proportion) %>%
  tidyr::pivot_longer(`Neurosurgery`:`Radiology`,
               names_to = "medical_specialty", values_to = "proportion")
```

And plot the relative proportions 
```{r}

ggplot2::ggplot(lemma.freq, ggplot2::aes(x=proportion, 
                                         y=`Neurology`,
                                         color=abs(`Neurology` - proportion))) + 
  ggplot2::geom_abline(color="gray40", lty=2) +
  ggplot2::geom_jitter(alpha=0.1, size=2.5, width=0.3, height=0.3) +
  ggplot2::geom_text(ggplot2::aes(label=lemma), check_overlap=TRUE, vjust=1.5) +
  ggplot2::scale_x_log10(labels=scales::percent_format()) + 
  ggplot2::scale_y_log10(labels=scales::percent_format()) + 
  ggplot2::scale_color_gradient(limits=c(0, 0.001), low="darkslategray4", high="gray75") +
  ggplot2::facet_wrap(~medical_specialty, ncol = 2) +
  ggplot2::theme(legend.position="none") +
  ggplot2:: labs(y="Neurology", x = NULL)
```
**6** What does this plot tell you about the relative similarity of lemma frequencies between neurosurgery and neurology and between radiology and neurosurgery? Based on what these specialties involve, is this what you would expect?

**7** Modify the above plotting code to do a direct comparison of Neurosurgery and Radiology (i.e., have Neurosurgery or Radiology on the Y-axis and the other 2 specialties as the X facets)

### TF-IDF Normalisation

Maybe looking at lemmas across all notes in a specialty is misleading, what if we look at lemma frequencies across a specialty.

```{r}
lemma.counts <- lemmatized.data %>% dplyr::count(medical_specialty, lemma)
total.counts <- lemma.counts %>% 
                      dplyr::group_by(medical_specialty) %>% 
                      dplyr::summarise(total=sum(n))

all.counts <- dplyr::left_join(lemma.counts, total.counts)
```
Now we can calculate the term frequency / invariant document frequency (tf-idf):

```{r}
all.counts.tfidf <- tidytext::bind_tf_idf(all.counts, lemma, medical_specialty, n) 
```

We can then look at the top 10 lemma by tf-idf within each specialty:

```{r}
all.counts.tfidf %>% dplyr::group_by(medical_specialty) %>% dplyr::slice_max(order_by=tf_idf, n=10)
```
**8** Are there any lemmas that stand out in these lists? Why?

We can look at transcriptions using these unusual lemmas to check how they are used with `stringr::str_detect`
```{r}
analysis.data %>% dplyr::select(medical_specialty, transcription) %>% dplyr::filter(stringr::str_detect(transcription, 'b.i.d')) %>% dplyr::slice(1)
```

**9** Extract an example of one of the other unusual "top lemmas" by modifying the above code

## Topic Modelling

In NLP, we often have collections of documents (in our case EMR transcriptions) that we’d like to divide into groups so that we can understand them separately. Topic modeling is a method for unsupervised classification of such documents, similar to clustering on numeric data.

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.


- Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.”


- Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the politics topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally.

LDA is a mathematical method for estimating both of these at the same time: finding the mixture of words that is associated with each topic, while also determining the mixture of topics that describes each document. There are a number of existing implementations of this algorithm, and we’ll explore one of them in depth.

First lets calculate a term frequency matrix for each transcription:
```{r}

lemma.counts <- lemmatized.data %>% dplyr::count(note_id, lemma)
total.counts <- lemma.counts %>% 
                      dplyr::group_by(note_id) %>% 
                      dplyr::summarise(total=sum(n))

all.counts <- dplyr::left_join(lemma.counts, total.counts)

emr.dcm <- all.counts %>% tidytext::cast_dtm(note_id, lemma, n)
```

Then we can use LDA function to fit a 3 topic (`k=3`) LDA-model
```{r}
emr.lda <- topicmodels::LDA(emr.dcm, k=3, control=list(seed=42))
emr.topics <- tidytext::tidy(emr.lda, matrix='beta')
```

Then we can extract the top terms per assigned topic:
```{r}

top.terms <- emr.topics %>% dplyr::group_by(topic) %>% 
  dplyr::slice_max(beta, n=10) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(topic, -beta)


top.terms %>% 
  dplyr::mutate(term=tidytext::reorder_within(term, beta, topic)) %>% 
  ggplot2::ggplot(ggplot2::aes(beta, term, fill=factor(topic))) + 
    ggplot2::geom_col(show.legend=FALSE) + 
    ggplot2::facet_wrap(~ topic, scales='free')  +
    tidytext::scale_y_reordered()
```


Now we can ask how well do these assigned topics match up to the medical specialties from which each of these transcripts was derived.

```{r}
specialty_gamma <- tidytext::tidy(emr.lda, matrix='gamma')

# we need to join in the specialty from the note_id
note_id_specialty_mapping <- lemmatized.data %>%
  dplyr::mutate(document=as.character(note_id)) %>% 
  dplyr::select(document, medical_specialty) %>% 
  dplyr::distinct()

specialty_gamma <- dplyr::left_join(specialty_gamma, note_id_specialty_mapping)
```

```{r}

specialty_gamma %>%
  dplyr::mutate(medical_specialty = reorder(medical_specialty, gamma * topic)) %>%
  ggplot2::ggplot(ggplot2::aes(factor(topic), gamma)) +
  ggplot2::geom_boxplot() +
  ggplot2::facet_wrap(~ medical_specialty) +
  ggplot2::labs(x = "topic", y = expression(gamma))
```

Interestingly, neurosurgery assigns mostly to a single topic but radiology and neurology are both more diverse in transcriptions.  We'd possibly expect this from radiology due to referring to imaging for many different diagnoses/reasons. 
However, this may all just reflect we are using too few topics in our LDA to capture the range of possible assignments. 

**10** Repeat this with a 6 topic LDA, do the top terms from the 3 topic LDA still turn up? How do the specialties get split into sub-topics?

## Credits

Examples draw heavily on material (and directly quotes/copies text) from Julia Slige's `tidytext` textbook.